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St/) Abstract 

Quantile regression is a technique to estimate conditional quantile curves. It pro- 
vides a comprehensive picture of a response contingent on explanatory variables. In 
a flexible modeling framework, a specific form of the conditional quantile curve is not 
a priori fixed. This motivates a local parametric rather than a global fixed model 
fitting approach. A nonparametric smoothing estimator of the conditional quantile 
curve requires to balance between local curvature and stochastic variability. In this 
paper, we suggest a local model selection technique that provides an adaptive esti- 
mator of the conditional quantile regression curve at each design point. Theoretical 
results claim that the proposed adaptive procedure performs as good as an oracle 

O^l which would minimize the local estimation risk for the problem at hand. We illustrate 

■^j- the performance of the procedure by an extensive simulation study and consider a 

couple of applications: to tail dependence analysis for the Hong Kong stock market 

If} and to analysis of the distributions of the risk factors of temperature dynamics. 
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Figure 1: The bandwidth sequence (upper panel), plot of data and the estimated 90% 
quantile curve (lower panel) 



1 Introduction 

Quantile regression is gradually developing into a comprehensive approach for the statis- 
tical analysis of linear and nonlinear response models. Since the rigorous treatment of 
linear quantile regression by Koenker and Bassett (1978), richer models have been intro- 
duced into the literature, among them are nonparametric, semiparametric and additive 
approaches. Quantile regression or conditional quantile estimation is a crucial element 
of analysis in many quantitative problems. In financial risk management, the proper 
definition of quantile based Value at Risk impacts asset pricing, portfolio hedging and 
investment evaluation, Engle and Manganelli (2004), Cai and Wang (2008) and Fitzen- 
berger and Wilke (2006). In labor market analysis of wage distributions, education effects 
and earning inequalities are analyzed via quantile regression. Other applications of condi- 
tional quantile studies include, for example, conditional data analysis of children growth 
and ecology, where it accounts for the unequal variations of response variables, see James 
et al. (2010). 

In applications, the predominantly used linear form of the calibrated models is mainly 
determined by practical and numerical reasonings. There are many efficient algorithms 
(like sparse linear algebra and interior point methods) available, Portnoy and Koenker 
(1989), Portnoy and Koenker (1997), Koenker and Ferreira (1999), and Koenker (2005), 
etc. However, the assumption of a linear parametric structure can be too restrictive in 
many applications. This observation spawned a stream of literature on nonparametric 
modeling of quantile regression, Yu and Jones (1998), Fan et al. (1994), etc. One line of 
thought concentrated on different smoothing techniques, e.g. splines, kernel smoothing, 
etc.; see Fan and Gijbels (1996). Another line of literature considers structural semipara- 
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metric models to cope with the curse of dimensionality, like, partial linear models, Hardle 
et al. (2012), etc., additive models, Kong et al. (2010), Horowitz and Lee (2005), etc; single 
index models, Wu et al. (2010), Koenker (2010), etc. Yet another strand of literature has 
been involved in ultra-high dimensional situations where a careful variable selection tech- 
nique needs to be implemented, Belloni and Chernozhukov (2010) and Koenker (2010). 
In most of the aforementioned papers on non and semiparametric quantile regression, a 
smoothing parameter selection is implicit, and it is mostly a consequence of theoretical 
assumptions like e.g. degree of smoothness, but falls short in practical hints for real 
data applications. An important exception is the method for local nonparametric kernel 
smoothing by Yu and Jones (1998) and Cai and Xu (2008). They both propose a data 
driven bandwidth choice. 

This paper offers a novel data-driven quantile regression procedure. Its numerical 
performance is illustrated by competitive simulation examples and applications to real 
data. The proposed adaptive local quantile regression algorithm is easy to implement and 
works for a wide class of applications. The idea of this algorithm is to select the bandwidth 
locally by a sequence of likelihood ratio tests. We also provide a rigorous theoretical study 
for the proposed method. The optimality results are stated as exact and sharp oracle 
risk bounds. In particular, we show that the performance of the adaptive procedure is 
essentially the same as the best possible one. The results apply for finite sample and under 
mild regularity conditions. 

The main message is that the proposed algorithm is spatially adaptive, stable in homo- 
geneous situation and sensitive to structural changes of the quantile curve. This conclusion 
is justified by theoretical results and confirmed by the numerical study. As an example, 
consider Figure 1 which presents our results for analyzing the Lidar data set, Ruppert 
et al. (2003). The presented quantile curve switches smoothness in the middle, and it is 
naturally reflected by the bandwidth sequence (upper panel) selected. In the presence of 
changing to sharper slope of the curve, the bandwidths get smaller to attain better approx- 
imations. This example shows that the algorithm proposed in this paper can adaptively 
choose the bandwidth at each design point. 

This article is organized as follows. Section 2 introduces the local model selection 
(LMS) procedure and explains how to important tuning parameters (critical values) can 
be computed. Section 3 presents a number of Monte Carlo simulations to illustrate the 
proposed methodology. In Section 4 the method is applied to check the tail dependency 
among portfolio stocks, and estimate quantile curves for temperature risk factors. Section 5 
presents our main theoretical result which states a kind of oracle risk bound for the 
proposed procedure: it performs nearly as good as the best one among the considered 
family of local quantile estimators. The necessary conditions and main steps of the proof 
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like "propagation" , "stability" and "oracle" property are delegated to the Appendix. There 
we also collect some of general results like majorization bounds and non- asymptotic Wilks 
Theorem for the likelihood ratio test statistics. 

2 Adaptive estimation procedure 

This section introduces the considered problem and offers an adaptive estimation proce- 
dure. 

2.1 Quantile regression model 

Given the quantile level r G (0, 1) , the quantile regression model describes the following 
relation between the response Y and the regressor X : 

P{Y > f(x) | X = x) = t, 

where f(x) is the unknown quantile regression function. This function is the target of the 
analysis and it has to be estimated from independent observations {Xi,Yi}f =1 . For the 
case of a deterministic design, this quantile relation can be represented as 

Y = f(Xi) + e t , (1) 

where the errors e, follow P(ei > 0) = r . 

For simplicity of presentation, we consider a univariate regressor X £ M 1 and a 
deterministic design in this paper, an extension to the d -dimensional case X £ lR d with 
d > 1 is straightforward. 

2.2 A qMLE View on Quantile Estimation 

The quantile function /(•) in (1) is usually recovered by minimizing the sum 

n 

Y^pAY-m)}, (2) 

i=i 

over the class of all considered quantile functions /(•) , where 

p T (u) d = u{t S.(u > 0) - (1 - r) S_(u < 0)} = u{t - 2(u < 0)}. 

Such an approach is reasonable because the true quantile function f(x) minimizes the 
expected value of the sum in (2). An important special case is given by r = 1/2 . Then 
an estimator of /(•) is built as minimizer of the least absolute deviations (LAD) contrast 
52\Yi- f(Xi)\. 
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The minimum contrast approach based on minimization of (2) can also be put in a 
quasi maximum likelihood framework. Assume that the residuals £j from (1) are i.i.d. 
and £(x) is their negative log-density on M 1 . Then the joint log-density is given by the 
sum 

-£*OW(*0} 

and its maximization is equivalent to minimization of the contrast (2) with a pdf from the 
asymmetric Laplace distribution ALD T : 

£(u) = l T (u) = log{r(l — t)} — p T (u), -co < u < oo. (3) 

The parametric approach (PA) assumes that the quantile regression function /(•) belongs 
to a family of functions {fe(x), 6 6>} , where is a subset of the (p + 1) -dimensional 
Euclidean space. Equivalently, 

f(x) = f *(x), 

where 0* is the true parameter which is usually the target of estimation. 
Examples are a constant model: 

fo*(x) = O , 

with 0* =0o or a linear model: 

fr(x) = 9 + 9 1 x, 

with e* = (0 o ,0i) T . 

Let Pq be the parametric measure on the observation space which corresponds to 
the regression model (1) with /(■) = fg(-) and with the i.i.d. errors following the 
asymmetric Laplace distribution (3). Then the log-likelihood L(6) = L(Y,6) for Pq 
can be written as 

n n 

L(0) = f log{r(l - r)} ^ 1 - ~ fe(X t )} (4) 

i=i i=i 

and the qMLE maximizes L(9) , or, equivalently minimizes the contrast Y27=i Pr{Yi — 
fe(Xi)} over all £ . 

The described parametric construction is based on two assumptions: one is about the 
error distribution (3) and the other one is about the shape of the regression function / . 
However, it is only used for motivating our approach. Our theoretical study will be done 
under the true data distribution which follows (1) under mild regularity conditions. The 
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next section explains how a smooth regression function / can be modeled by a flexible 
local parametric assumption. 

2.3 Local polynomial qMLE 

This section explains how the restrictive global PA /(■) = /#*(•) can be relaxed by using 
a local parametric approach. Let a point x be fixed. The local PA at a point x £ M only 
requires that the quantile regression function /(•) can be approximated by a parametric 
function fg(-) from the given family in a vicinity of x . Below we fix a family of polynomial 
functions of degree p motivated by Taylor approximation: 

/(«) « fg = 6 + O^u - x) + . . . + 9 p (u - xf/p\ (5) 

for = (6>o, . . . , P ) T . The corresponding parametric model can be written as 

Yi = vje + £i , (6) 

where = {1, (X t - x), {X t - x) 2 /2\, . . . , (X t - x) p /p\} T G M p+1 . 

A local likelihood approach at x is specified by a localizing scheme W given by a 
collection of weights Wi for i = l,...,n. The weights Wi vanish for points lying 
outside a vicinity of the point x . A standard proposal for choosing the weights W is 
Wi = K\ oc {(Xi — x)/h} , where K\ oc (-) is a kernel function with a compact support, while 
/i is a bandwidth controlling the degree of localization. 

Define now the local log-likelihood at x by 

n n 

L(W, 6) d ^ f log r(l - r ) J>i - £ p r (^ - S^flH • (7) 

j=l i=l 

This expression is similar to the global log-likelihood in (4), but each summand in L(W, 6) 
is multiplied with the weight wi , so only the points from the local vicinity of x contribute 
to L(W, 6) . Note that this local log-likelihood depends on the central point x via the 
structure of the basis vectors ^ and via the weights Wi . The corresponding local qMLE 
at x is defined via maximization of L(W, 6) : 

0(x) = {#oM, #1 Or),---, W} T (8) 

*^= f argmaxL(W, 6) 
e&e 

= argmin p T (1^ — 6)wi . 
eee ^ 

The first component 6o(x) provides an estimator of f(x) , while 6 m (x) is an estimator of 
the derivative f( m \x) , m = 1, . . . ,p . 
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2.4 Selection of a Pointwise Bandwidth 

The choice of bandwidth h is an important issue in implementing (8). One can reduce the 
variance of the estimation by increasing the bandwidth, but at a price of possibly inducing 
more modeling bias measured by the accuracy of approximation in (5); see Figure 2. 

A desirable choice of a bandwidth at a fixed point would strike a balance between 
the variance and the bias depending on the local shape of /(•) in the vicinity of x . 
Many approaches have been proposed along this line; see e.g. Yu and Jones (1998) and 
references therein. However, their justification and implementation is based on asymptotic 
arguments and require large samples. Here we propose a pointwise bandwidth selection 
technique based on a finite sample theory. 

Our basic setup of the algorithm is described as follows. First one fixes a finite ordered 
set of possible bandwidths hi < \i2 < ■ ■ ■ < hx , where h\ is very small, while hx 
should be a global bandwidth of the order of the design range. The bandwidth sequence 
can be taken geometrically increasing of the form hk = ab k with fixed a > , b > 1 , 
and n _1 < ab k < 1 for k = 1, . . . , K (A.2. ). The total number K of the candidate 
bandwidths is then at most logarithmic in the sample size n . For each k < K , an ordered 



weighting schemes = (w[ \ w 2 ,...,w n ) T is defined via w\ = K\ oc {{x—Xi)/hk} 



The proposed selection procedure is similar in spirit to Lepski et al. (1997). If the underly- 
ing quantile regression function is smooth, one can expect a good quality of approximation 
(5) for a large bandwidth among {hk\k=\ . Moreover, if the approximation is good for 
one bandwidth, it will be also suitable for all smaller bandwidths. So, if we observe a 
significant difference between the estimator 9}~{x) corresponding to the bandwidth hk 
and an estimator 6^{x) corresponding to a smaller bandwidth hn , this is an indication 
that the approximation (5) for the window size hk becomes too rough. This justifies the 
following algorithm. Start with the smallest bandwidth h\ . For any k > 1 , compute 
the local qMLE 0k(x) and check whether it is consistent with all the previous estimators 
0((x) for I < k . If the consistency check is negative, the procedure terminates and selects 
the latest accepted estimator. 

The most important ingredient of the method is the consistency check. The Lepski 
method suggests to use the difference 0k{x) — 0e( 

clS cl test statistic; see e.g. Lepski 
et al. (1997). We follow the suggestion from Polzehl and Spokoiny (2006) and apply 
a localized likelihood ratio type test. More precisely, the local MLE 0i(x) maximizes 
the log-likelihood L(W^\Q) , and the maximal value of (7) given by sup^ 



leading to a local quantile estimator Ok(x) with 
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L(W^\ Oe(x)) is compared with the particular log-likelihood value L(W^,0k(x)) , where 
the estimator Ok(x) is obtained by maximizing the other local log-likelihood function 
L(W (k \6). The difference L(wW,O e (x)) - L(WW,9 k (x)) is always non-negative. The 
check rejects Ok(x) if this difference is too large for some £ < k . Equivalently one can 
say that the test checks whether Ok(x) belongs to the confidence sets £^(3) of Og(x) : 

£,(3) = {0 G O : L(W^Mx)) - L{W^\B) < 3}. 

A great advantage of the likelihood ratio test is that the critical value 3 can be selected 
universally. This is justified by the Wilks phenomenon: the likelihood ratio test statistics is 
nearly \ 2 an d its asymptotic distribution depends only on the dimension of the parameter 
space. Unfortunately, these arguments do not apply to finite samples under possible model 
misspecification and we therefore offer an alternative way of fixing the critical values 3 
which is based on the so called propagation condition. We also allow that the width of the 
confidence set £^(3) depends on the index £, that is, 3 = 3^ . Our adaptation algorithm 
can be summarized as follows: at each step k , an estimator 9k(x) is constructed based 
on the first k estimators 0±(x), . . . , by the following rule, 

• Start with Oi(x) = Oi{x) . 

~ ^ dcf ~ ~ 

• For k > 2, Ok(x) is accepted and 0k(x) = Ok(x) , if Ok-i{x) was accepted and 

L(W^,9 e (x)) -L{W^\6 k (x)) < 3t , £ = !,..., k-\. (10) 

• Otherwise #fc(x) = 0fc_i(x) . 

The adaptive estimator 6(x) is the latest accepted estimator after all K steps: 

e(x) = e K { x ) 

A visualization of the procedure is presented in Figure 2. The critical values 3^ 's are 
selected by an algorithm based on the propagation condition explained in the next section. 

2.5 Parameter Tuning by Propagation Condition 

The practical implementation requires to fix the critical values of 31, . . . , Ik-i ■ We apply 
the propagation approach which is an extension of the proposal from Spokoiny (2009); 
Spokoiny and Vial (2009). The idea is to tune the parameter of the procedure for one arti- 
ficial parametric situation. Later we show that such defined critical values work well in the 
general setup and provide a nearly efficient estimation quality. The presented bandwidth 
selector can be viewed as a multiple testing procedure. This suggests fixing the critical 
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Figure 2: Demonstration of the local adaptive algorithm. 

values as in the general testing theory by ensuring a prescribed performance under the null 
hypothesis. In our case, the null hypothesis corresponds to the pure parametric situation 
with /(•) = /#*(•) in the equation (1). Moreover, we fix some particular distribution of 
the errors £j , our specific choice is ALD T with parameter r . Below in this section we 
denote by Pg* the data distribution under these assumptions. 

For this artificial data generating process, all the estimators k (x) should be consistent 
to each other and the procedure should not terminate at any intermediate step k < K . 
This effect is called as propagation: in the parametric situation, the degree of locality will 
be successfully increased until it reaches the largest scale. The critical values are selected 
to ensure the desired propagation condition which effectively means a "no false alarm" 
property: the selected adaptive estimator coincides in the most cases with the estimator 
Ok{x) corresponding to the largest bandwidth. The event {0 k (x) ^ k (x)} for k < K is 
associated with a false alarm and the corresponding loss can be measured by the difference 

L(w^,e k (x),d k (x)) = L{w^,e k (x)) - L{w^,e k (x)). 

The propagation condition postulates that the risk induced by such false alarms is smaller 
than the upper bound for the risk of the estimator k (x) in the pure parametric situation: 

E e *L r (W^\e k {x),e k {x)) < a % k = 2,...,K, (11) 

where the constant 3J r is such that for all k < K , it holds 

E e *L r {w( k \e k ( x ),e*)<%.. 

The values a and r in (11) are two hyper-parameters. The role of a is similar to the 
significance level of a test, while r denotes the power of the loss function. It is worth 
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mentioning that 

E e *L r {W^ k \e k {x),e k {x))^ P<r{O k (x) ± e k (x)}, r -> 0. 

The critical values 31, . . . ,%k-i enter implicitly in the propagation condition: if the false 
alarm event {6 k (x) / Ok(x)} happens too often, it is an indication that some of the 
critical values 31, . . . ,3fc_i are too small. Note that (11) relies on the artificial parametric 
model Pq* instead of the true model P . The point 6* here can be selected arbitrarily, 
e.g. 6* = . This fact relies on the linear parametric structure of the model (6) and is 
justified by the following simple lemma. 

Lemma 1. The distribution of L{W^ k \0 k (x),6 k (x)) and of L(W^ h \0 k (x),9*) under 
Pe* does not depend on 6* . 

Proof Under PA /(•) = fo (•) , it holds Y t - f{Xi) =Y t - 9* = £i and hence, 

n n 

L(WW,0) = log{r(l - r)} £ wf ] + £ p T {e l -*J{9- 0*)}w?\ 

i=i i=i 

A simple inspection of this formula yields that the distribution of L(W^ k \ 0) only depends 
on u = 6 — 0* . In other words, we can use the free parameter u = 9 — 6* whatever 6* 
is, e.g. 6* = . The same argument applies to the difference L(W^ k \ 9 k (x), 6^(x)) for 
£<k. Moreover, L(W ( - k \ 6 k (x), k {x)) is a function of {L{W^ k \ 6 k {x), Q t {x)) }\ =1 , so 
the distribution of L{W^ k \ k (x), k (x)) does not depend on 0* . □ 

A choice of critical values 31, . . . ,$k-i can be implemented in the following way: 

• Consider first only 31 and fix 32 = • • • = lK-\ = 00 , leading to the estimators 
9 x) for k = 2, . . . , K . The value 31 is selected as the minimal one for which 

^-E^L r (w( k \e k (x),e k ( U ,x)) < k = 2,...,K. (12) 

• With selected 31, . . . ,3fc-i , set $ k+ \ = . . . = Ik-\ = 00 . Any particular value of i k 
would lead to the set of parameters 31, • • • , 3fe, 00, . . . , 00 and the family of estimators 
9 m (ii, ■ ■ ■ , 3fc, x) for m = k + 1, . . . , K . Select the smallest 3^ ensuring 

^-E^L r {W^\e m (x),6 m ( il , d2 ,..., U ,x)) < (13) 

for all m = k + 1 , . . . , K . 
Few remarks to the proposed algorithm. 

1. A value 31 ensuring (12) always exists because the choice 31 = 00 yields 9 k ($i,x) = 
9 k (x) for all k > 2 . 
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2. The value L r (W( m \0 m (x),0 m ($i,$2,---,iki x )) from (13) only accumulates the 
losses associated with the false alarms at the first k steps of the procedure. The 
other checks at further steps are always accepted because the corresponding critical 
values . . - Ik-\ are set to infinity. 

3. The accumulated risk bound grows at each step by a/(K — 1) . This value can 
be seen as maximal risk associated with the CV's 3i, 32, - ■ ■ , 3fc - 

4. The value ik ensuring (13) always exists, because the choice 3^ = 00 yields 

m {di,d2, ■■■,lk,x) = m (3i,32, • • -,ik-i,x) 

for all m > k . 

5. All the computed values depend on the considered linear parametric model, the 
sequence of bandwidths hk and the quantile level r . They also depend on the local 
point x via the basis vectors <Pi . However, under common regularity conditions 
on the design X\, . . . , X n , the dependency on x is rather minor. Therefore, the 
adaptive estimation procedure can be repeated at different points without reiterating 
the steps of selecting the critical values. 

3 Simulations 

First, we check the critical values at different quantile levels r = 0.05,0.5,0.75,0.95 and 
for different noise distributions: a) ALD, b) normal and c) student i(3) . We also study 
how misidentification of noise distribution affects the critical values. 

Second, we compare the performance of our local bandwidth algorithm with two other 
bandwidth selection techniques. One proposal is from Yu and Jones (1998), in which 
they consider a rule of thumb bandwidth based on the assumption that the quantiles 
are parallel, and another comes from Cai and Xu (2008), where an approach based on a 
nonparametric version of the Akaike information criterion (AIC) is implemented. 

3.1 Critical Values 

Here we summarize our numerical results on choosing the critical values by the propagation 
condition as described in Section 2.5. We only consider local constant modeling with p = 
and local linear modeling with p = 1 starting with p = . 

Table 1 shows the critical values with several choices of a and r with r = 0.75 , m = 
10000 Monte Carlo samples, and an bandwidth sequence (8, 14, 19, 25, 30, 36, 41, 52)*0.001 
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Table 1: Critical Values with different r and a 



a 


= 0.25, 


r 


= 0.5 


6.123 


2.333 


0.987 


3.678e-05 


0.000 


a 


= 0.5, 


r 


= 0.5 


4.616 


1.578 


0.357 


2.472e-05 


0.000 


a 


= 0.6, 


r 


= 0.5 


3.203 


0.679 


0.025 


0.006 


7.278e-05 


a 


= 0.25, 


r 


= 0.75 


9.127 


3.288 


1.031 


0.126 


5.675e-05 


a 


= 0.25, 


r 


= 1 


12.75 


4.280 


1.224 


1.095e-04 


0.000 



Table 2: Critical Values with Different r 



T 


= 0.05 


6.464 


2.204 


0.620 


3.345e-05 


0.000 


T 


= 0.5 


7.997 


3.089 


0.986 


0.300e-05 


0.000 


T 


= 0.75 


9.203 


3.910 


1.106 


0.123 


7.254e-05 


T 


= 0.95 


8.589 


5.452 


1.904 


0.334 


1.203e-05 



scaled to the interval [0, 1] . Critical values decrease when a increases, and increase when 
r increases. 

Table 2 displays critical values for different r, with a = 0.25, r = 0.5 , m = 10000 
Monte Carlo samples, a bandwidth sequence $)i = (8, 14, 19, 25, 30, 36, 41, 52) * 0.001 , and 
N(0, 1) noise. Critical values behave similarly for different r . 

Table 3 displays the critical values for three alternative bandwidth sequences: 

£i = (8, 14, 19, 25, 30, 36, 41, 52) * 0.001 
9)2 = (8, 16, 25, 36, 49, 63, 79, 99) * 0.001 
f) 3 = (5, 8, 14, 19, 27, 36, 46, 58) * 0.001 

with a = 0.25 , r = 0.5 , and r = 0.85 . Although the critical values differ for different 



Table 3: Critical Values with Different Bandwidth Sequences 





11.33 


1.243 


6.933e-05 0.000 


0.000 




18.39 


6.479 


2.230 0.469 


8.738e-05 




6.123 


2.333 


0.987 3.678e-05 


0.000 



bandwidth sequences, a, r and r , they indicate the same patterns (finite and decreasing) . 

We simulate from different data generating processes, namely the distribution of £j 
(given by the density £(■)) does not necessarily coincide with the likelihood (ALD T ) 
taken to simulate critical values. Table 4 presents critical values simulated under i(3) , 
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N(0, 1) and ALD T . The critical values show the same trend with some differences, so we 
conclude that a misidentification of error distribution would not significantly contaminate 
the confidence sets. 



Table 4: Critical Values with Different Noise Distributions 



N(0,1) 


11.50 


4.924 


2.514 


1.313 


2.765e-05 


ALD T 


14.05 


6.554 


3.304 


1.443 


5.879e-05 


t(3) 


15.42 


8.707 


2.370 


0.342 


3.898e-05 



In Table 5, critical values are shown in the same circumstances as in Table 4 for the 
local linear case. Since introducing one more variable (trend), critical values doubled or 
tripled compared to the local constant case. The behavior with respect to tail functions 
stays the same. 

Table 5: Critical Values with Different Noise Distributions in Local Linear Case 



N(0,1) 


29.97 


58.64 


43.21 


33.41 


19.43 


07.40 


ALD(0.5) 


45.28 


74.51 


66.43 


50.42 


31.42 


13.50 


t(3) 


51.77 


84.94 


59.28 


44.99 


29.07 


11.57 



3.2 Comparison of Different Bandwidth Selection Techniques 

We illustrate our proposal by considering x € [0, 1] , r = 0.75 . The sample with (n = 
1000 ) are simulated under three scenarios: 

{0 if x e [0, 0.333] ; 
8 if x £ (0.333, 0666] ; 
-1 if x e (0.666, 1] 

fW(x) = 2x(l + x), 

/[3]( x ) = sin(fcix) + cos(k 2 x) TL{x € (0.333, 0.666)} + sm(k 2 x) 

The noise distributions are: Ji(0, 0.03), ALD T , t(S) . 

Figure 3 presents pictures on comparisons of different estimators in the local constant 
case. Figure 4 and 5 show in the local linear case the estimators of the functions ( fix) ) 
and its first derivatives as well. Our technique provides closer fits to the true curve ( f(x) ) 
than methods with a global fixed bandwidth, especially in the presence of jump. Table 
6, which shows the averaged absolute errors for the four methods, further confirms our 
conclusion. 
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Table 6 


: Comparison 


of Monte Carlo errors, averaged 


over 1000 samples 




Fixed bandw 


Local constant 


Local linear 


Fixed bandw (Cai) 




0.654 


0.172 


0.169 


0.378 


f [2] (x) 


0.206 


0.008 


0.008 


0.245 




0.137 


0.021 


0.019 


0.123 



Table 7 offers a further analysis for misspecified error distributions. Specifically, to 
evaluate the accuracy of our estimation for error distributions generated differently than 
the ALD density. Table 7 gives L\ errors between /(•) (with critical values simulated from 
ALD T ) and /(■) , from which we conclude that mis-specification of error distributions 
would not contaminate our results significantly. 



Table 7: Comparison of error mis-specification, errors are calculated averaged over 1000 
samples 





Local constant { N(0, 1) } 


Local constant { t(3) } 


Local linear { N(0, 1) } 




0.252 


0.220 


0.169 


f [2] (x) 


0.070 


0.016 


0.043 




0.009 


0.021 


0.019 



4 Applications 

In the study of financial products, it is very important to detect and understand tail 
dependence among underlyings such as stocks. In particular, the tail dependence structure 
represents the degree of dependence in the corner of the lower-left quadrant or upper- 
right quadrant of a bivariate distribution. Hauksson et al. (2001) and Embrechts and 
Straumann (1999) provide good access to the literature on tail dependence and Value 
at Risk. With the adaptive quantile technique, we provide an alternative approach to 
studying tail dependence. 

The correlation is calibrated from real data as given in Figure 6, where X is stan- 
dardized return from stock "clpholdings" from Hong Kong Hangseng Index, and Y is 
return from stock "cheung kong" . The conditional quantile function is linear, for example, 
X\ G 3sf(ui,<7i) and X2 £ N(ii2,0"2) , the conditional quantile function a is: 

f(x) = (p~ 1 (a)(a 2 - o"i 2 /cri) + m + a^cr^^x - u 2 ). 

Figure 6 show the empirical conditional quantile curves actually deviate from the one 
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(d) ALD(0.5) 
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Figure 3: The bandwidth sequence (upper left panel), the smoothed bandwidth (magenta 
dashed); the data with noise (grey, lower left panel), the adaptive estimation of 0.75 
quantile (dashed black), the quantile smoother with fixed optimal bandwidth = 0.06 
(solid black), the estimation with smoothed bandwidth (dashed magenta); boxplot of 
block residuals fixed bandwidth (upper right), adaptive bandwidth (lower right) 
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(a) (b) 

Figure 4: The bandwidth sequence (upper left panel), the smoothed bandwidth sequence 
(dashed magenta); the observations (grey, lower left panel), the adaptive estimation of 
0.75 quantile (dotted black), the true curve (solid black), the quantile smoother with 
fixed optimal bandwidth = 0.063 (dashed dotted blue), the estimation with adaptively 
smoothed bandwidth (dashed magenta) ; the blocked error of the adaptive estimator (lower 
right); the fixed estimator (upper right). 




200 400 600 800 1000 ' 1 1 1 1 r 

1 2 3 4 5 



(a) (b) 

Figure 5: The adaptive estimation of first derivative of the above quantile function (left 
panel grey), the true curve (solid black), the estimation with smoothed bandwidth (dashed 
black), the quantile smoother with fixed optimal bandwidth = 0.045 (dotted black); the 
blocked error of the adaptive estimator (lower right); the fixed estimator (upper right). 
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Figure 6: The bandwidth sequence with smoothed bandwidth curve (upper left panel), the 
smoothed bandwidth (dashed magenta); Scatter plot of stock returns (upper right panel), 
the adaptive estimation of 0.90 quantile (solid magenta), the quantile smoother with fixed 
optimal bandwidth = 0.15 (dotted black); fixed bandwidth curve (dotted black), adaptive 
bandwidth curve (grey), the estimation with smoothed bandwidth (dashed magenta), 
confidence band (dashed black) (lower left panel); adaptive bandwidth with normal scale 
(lower right). 



calculated from normal distributions, which implies non normality. The motivation of 
adaptive bandwidth selection is clear to see from Figure 6, the dependency structure 
change is more obvious compared with the fixed bandwidth curve. Moreover, the flexible 
adaptive curve is not likely to be a consequence of overfitting since it mostly lies in the 
confidence bands produced by fixed bandwidth estimation, see Hardle and Song (2010). 

We measure the deviation from normality by accumulated L\ distance to the normal 
fitting and examine different combination of stocks from Hong Kong Hangseng Index. The 
results is summarized in Table 8. 



Table 8: Summary of deviation from normality 





Chalco 


Cosco pacific 


Bank of China 


New world devo 


0.252 


0.220 


0.169 


Sino land 


0.070 


0.016 


0.043 


Swire pacific A 


0.009 


0.021 


0.019 
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Another application of quantile function estimation is in temperature data analysis, 
which is of key interest for pricing temperature derivatives. Quantile regression can provide 
a more flexible and comprehensive approach to understand the temperature risk drivers 
defined in (14). 

Denote daily temperature as (t,j) , with t = 1, ■ ■ ■ , r = 365 days, j = 0, • • • , J 

years. The time series decomposition for T t j is given as: 

X t ,j = T t j - A t 

L 

i=i 

thj ~ N(0,1), 

def 

£t,j - &t£t,j 

L 

£t,j = X 3 Q 5 j +t - Pl X 3G5j+t-l (14) 
1=1 

where T t j is the temperature at day t in year j , A t denotes the seasonality effect and 
a t the seasonal volatility. 

We are interested specifically in the stochastic risk drivers £tj , Figure 7 presents a time 
series plot of £t,j/<rt , and the estimated 90% quantile function. By zooming in the curve, 
we observe a very interesting phenomena: a changing of the trend of the standardized 
residual over years. 

To further understand the risk factors, we analyze the quantile functions of ef ■ over 
12 years, and average over 4 years for comparison, see Figure 8 and Figure 9. The 
differences between Berlin and Kaoshiung are easy to see, the variance function has a high 
value for Jan-Feb, while for Berlin the peaks and to come more in summer. Moreover, 
there is a tendency for Kaoshiung to be more volatile over time, but this phenomenon 
does not appear in Berlin. 

In addition, our technique can also be used to estimate the function at ■ We propose 
four methods: 1, Estimate the median curve of e'tj using adaptive technique. 2, Take 
{/e,o.75 — /e,o.25}/l-34 ( 1.34 is the inter quartile range of a standard normal distribution), 
where / £) o.75 , /e,o.25 are the adaptive quantile estimators. 3, Estimate the mean curve 
of £t,j using adaptive bandwidth. 4, Estimate the mean function of etj with a fixed 
bandwidth. The aforementioned methods are compared by testing the normality of rjtj = 
£t,j/&t ■ As according to our normal assumption on r] t j , a good estimation for at leads 
to normal standardized residuals fjtj ■ Table 9 and 10 summarize statistics from the 
normality test of standardized residuals from three methods in Berlin and Kaoshiung. It 
can be seen that Berlin has more normal residuals than Kaoshiung. Method three is always 
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Figure 7: Plot of quantile curve for standardized weather residuals over 40 years in Berlin, 
95% quantile, 1967 — 2006. Selected bandwidths (upper), observations with estimated 
the quantile function (middle), the estimated the quantile function (lower). 
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Figure 8: Estimated 90% quantile of variance functions, Berlin, average over 1995 — 1998 . 
1999 -2002 (red), 2003 - 2006 (green) 
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Figure 9: Estimated 90% quantile of variance functions, Kaoshiung, average over 1995 — 
1998, 1999- 2002 (red), 2003- 2006 (green) 

better at getting more normal residuals, and method two is compatible with method three. 
It may be due to the fact that quantiles at higher or lower levels are better at explaining 
the extremes of the volatility function. Method four performs not so well, as it is with a 
fixed bandwidth. Therefore we conclude that our adaptive technique is useful in modeling 
temperature residuals. 

Table 9: P- values of Normality Tests: Berlin 





AD 


JB 


KS 


1 


0.000 


0.010 


0.060 


2 


0.062 


0.000 


0.020 


3 


0.054 


0.487 


0.171 


4 


0.009 


0.000 


0.002 



Table 10: P-values of Normality Tests:Kaoshiung 





AD 


JB 


KS 


1 


0.000 


0.000 


0.000 


2 


1.03e-05 


0.077 


0.043 


3 


2.37e-06 


0.742 


0.674 


4 


0.000 


0.021 


0.019 
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5 Finite Sample Theory 



This section discusses some theoretical properties of the proposed estimator 0{x) = Qt(x) 
under a general data distribution. Here k = k(x) is the index selected by the pointwise 
procedure from Section 2.4. The main "oracle" result shows that 0{x) is adaptive in the 
sense that it provides nearly the same quality of estimation as the oracle estimator k *(x) 
which is the best in the family \0k{ x )} k —i • A precise definition of k* will be given below 
in term of the modeling bias. 

5.1 Modeling Bias 

The proposed approach for the bandwidth selection suggests taking a larger and larger 
bandwidth until the linear parametric assumption is not significantly violated on the 
considered interval. The likelihood ratio test statistics L(W^\ Oe(x), Ok(x)) from (10) 
are used for this check. The formal definition of the best or oracle choice requires the 
introduction of a measure for the deviation of the function /(■) from its best linear 
approximation of the form on the interval of radius h k considered at step k of 
the procedure. We follow Spokoiny (2009) who introduced the modeling bias to measure 
the deviation from the linear parametric structure. Define Pi as the distribution of the 
observation Yi . Let also Pi )S be a shift of Pi by s , that is, the distribution of Yi — s . 
Also denote fi = f(Xi) and fi(0) = . In particular, P%j { is the distribution of 

def 

Ei = Yi — f(Xi) , so that its r-quantile is zero. The underlying measure P is the product 
of the measures Pij t ■ Under the linear PA f(Xi) = fg(Xi) , the corresponding measure 
Pq is the product of the Pij^e) ■ 

n n 

p = U p ^> p o = U p ijm- 

i=i i=i 

The modeling bias at step k measures the deviation of the true quantile function / from 
the linear parametric one and it is defined as 

A k d ^ f MA k (0), 

n 
i=l 

Here %(P, Q) is the Kullback-Leibler divergence between two measures P and Q . The 
quantity A k (0) can be viewed as weighted Kullback-Leibler divergence between P and 
Pg localized to the observations in the interval of radius h k around x . The value A k 
describes the quality of the best linear approximation on this interval. The small modeling 
bias (SMB) condition manifests that the value A k does not exceed a prescribed quantity 
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A > , and the oracle choice of the bandwidth h k is defined as the largest bandwidth 
among h k for which the SMB condition is satisfied: 



Spokoiny (2009) argued that such a choice leads to the bias-variance trade-off in the 
usual nonparametric sense. Thus, the oracle bandwidth yields the rate optimal estimation 
quality in the asymptotic set-up. 

All the introduced quantities depend on the central point x . Therefore, the parameter 
6* of the best parametric fit and the oracle bandwidth k* also depend on x : our approach 
allows us to specify the best bandwidth for each point separately. Under the measure Pg* , 
the estimate 6 k (x) is close to 0* in the sense that the confidence set £k(dk) covers 0* 
with a high probability and the risk Eg*L r {W^ k \ 9 k (x), 0*) remains bounded by a fixed 
constant 3l r for all k < K . The definition of the modeling bias based on the Kullback- 
Leibler divergence allows for the translation of these properties to the general case at the 
cost of the additional factor e^ . More precisely, the following bound holds. 



Theorem 5.1. Let 6* and k* < K be such that A k *(6*) < A . Then for each k < k* 



So, if A is small all the confidence or risk bounds continue to apply even in the local 
nonparametric situation. 

5.2 "Oracle" Property 

This section presents our main result called the oracle risk bound. The main message 
of this result is that the adaptive estimator 6{x) performs nearly as well as the best 
(oracle) estimator does. Our theoretic study is performed under the assumption that the 
critical values i k are computed under the measure Pq* described in Section 5.1. Due 
to Lemma 1, a particular choice of the parameter 6* does not matter. In addition, Pq* 
involves the distribution of the residuals — f{Xi) which is not available. However, one 
can use a proxy for this distribution, because the critical values are rather stable w.r.t. to 
the error distribution; see Corollary 1 and discussion afterwards for more arguments. 

Let the bandwidth index k* be defined by the SMB condition (15) leading to the 
oracle estimator Ok* (x) . The next result claims that for the final estimator 6{x) , the 
difference 



k* = argmax{Z\fc < A}. 



(15) 



k<K 




L(w^\e k *(x),e(x)) =L{w^\e k *(x)) -L{w^\e{x)) 
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is not larger in order than 3^* e • Later we show that the critical value ik* is at most 
logarithmic in the sample size n . Therefore, the oracle result means that the adaptive 
estimator 6{x) belongs with a dominating probability to a confidence set of the oracle. 

Theorem 5.2. Suppose A.1-A.5. Let k* < K be such that A k *(9) < A. Then 

f U{W^),6 k '{x)Ax)) \ ,1*. 
Elogi 1 + f < a + Z\ + log(l + — ). 

An interesting special case of this result is the pure parametric situation with a linear 
(in & ) quantile function / . The oracle estimator 6 k * corresponds to k* = K , that is, 
to the largest bandwidth hx • If it is large enough, then 6k nearly coincides with the 
global quantile estimator. Moreover, if the errors Yj — /pQ) are i.i.d. Laplacian, then 
is nearly efficient. The critical values it decreases with k and the largest one %k is 
usually close to zero. So, our oracle result yields that the proposed adaptive procedure is 
nearly efficient in the parametric situation. 

6 Appendix 

The appendix collects the conditions, technical results, and the proofs. First we fix our 
assumptions. We assume independent observations Y\,...,Y n . The results are stated 
for a deterministic design X\, . . . ,X n under mild regularity conditions. The case of a 
random design can be considered by the usual conditioning argument. Given r , the 
quantile function /(•) is defined by the relation P{Yi > /(Xj)} = r . To avoid ambiguous 
notation, we suppose that this equation has an unique solution for each i . The general 
case can be easily reduced to this one by standard arguments; see e.g. Koenker (2005). 
We also denote by Pi the distribution of the residual e% = Y% — f(Xi) and by li(-) its 
density. Below a point x is fixed and the target of estimation is the quantile fix) . The 
local parametric approach requires fixing a localizing weighting scheme W = (w±, . . . , w n ) 
and linear parametric family fe( ) with fe(Xi) = , where !Z^ im = (Xi — x) m /ml for 
m = 0, 1, . . . ,p . 

Our theoretical study can be separated into two parts. An essential and the most 
difficult part is done under the linear parametric assumption /(•) = fg*(-) , then we 
extend the results to the case when this assumption is approximately fulfilled in a local 
vicinity of the central point x . 

Below a family of localizing weighting schemes = {w^}™_ 1 for k = 1,...,K 

is supposed to be fixed. Our standard proposal is = K\ oc {(Xi — x)/h k } for a given 
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kernel K\ oc (-) and a sequence of band widths hi < h,2 < ■ ■ ■ < hx ■ Define 

n 

&k = E^M(0K w (16) 



1=1 



Vi d ^ f Var{VL(W^), 6*)}= r(l - r) 5>^>f f, (17) 

i=i 



iV; 1/2d ^ f max sup " "\7"Z\ ' "V r(l-r). (18) 
7 eiRP+i 11*47 II 

Here and V fc 2 are symmetric (p + 1) x (p + 1) matrices: D? can be defined similarly 
to the Fisher information matrix D 2 . = —\7 2 EL(W^ k \6*), while V k 2 is the covariance 
matrix of the score VL{W^ k \6*) under the parametric assumption f = fo* ■ In the 
global parametric situation, these two matrices coincide. The value 7V fe can be treated as 
the local sample size corresponding to the localizing scheme . 
The following conditions will be assumed for our results. 

A.l {Yj}™ =1 are independent. 

A. 2 For some constants < Uo < u < 1 , 

< u < \\D k l D\_ x D k ^loo < u < 1. 



|7 T *«|a(u;W>0) ^ 



A. 3 For a constant o > and all k = 1, . . . , K , it holds 

^ 2 < a 2 ^- 



A. 4 For some fixed (5 < 1/2 and /? > , 

\£i(u)/£i(0) - l| < 5, \u\<p. 

A. 5 The kernel function K\ oc (-) is symmetric, K(0) = 1 , if (u) decreases in u > and 
= for |u| > 1 . 

The condition A. 2 effectively requires that the bandwidth sequence hk grows geomet- 
rically with k . Condition A. 3 is the local identifiability condition and it ensures that the 
local variability of the process L(W^ k \6) measured by the matrix V k is not significantly 
larger than the local information measured by the matrix D? . A. 4 only requires that 
the density functions £i(-) are uniformly continuous in a vicinity of zero. In particular, 
the residuals can be unequally distributed. All the results below tacitly assume that the 
conditions A.l -A. 5 hold. 
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Below we use generic notation C = C(A) to indicate that a constant C only depends 
on the constants from conditions A.1-A.5 like o, p, 5, Uq , u, etc. We will also use 
conditions (Ex) , (Xr) etc. defined later in section 6.2.1. 

6.1 Uniform concentration of the MLEs k (x) under P e * 

The first result explains the localization property of the estimators 9 k (x) from (9) under 
the linear parametric structure of the quantile function, that is, f(Xi) = 6* . With 
some value ro fixed, define for each k < K a local elliptic set 

fc (r o ) = {0:||V fe (0-0*)||<r o } 

with V k 2 from (17). The question under study is a proper choice of the radius ro which 
ensures a prescribed small deviation probability for the event 0k(x) k (xo) uniformly 
in k < K . 

Theorem 6.1. Suppose (Ex) and (£r) , and there exist constants C\ = C\(A) and 
C2 = 02(A) such that the conditions 

x 2 >d(x + p+l), p 2 N k >C 2 (x + p+l) (19) 

ensure for k < K 

P e *{d k (x)gG k (x )} < 2e" x , 
E *[L r (w( k \d k (x),O*)lL{e k (x)?0 k (x o )}] < C(A)e~\ 

In particular, a choice x = log(if) + xo and then Iq > Ci(x + p + 1) ensures a 
dominating probability 1 — 2e _x ° for the joint concentration event 

K 

Ai= f]{e k (x)ee k (x )}. 

k=l 

In what follows we suppose that the values x = log(-fT) + xo and ro are fixed in a way 
that the probability of the set A\ is sufficiently close to 1. This allows us to restrict 
ourselves to the case when each estimator 6 k (x) belongs to the local vicinity k (xo) . 
The conditions in (19) require that Xq is of the order log(K) + (p + 1) , and the local 
sample size N k should be at least of the same order. This conclusion is in agreement with 
our numerical simulation results (not reported here) . An increase of the polynomial degree 
p requires the increase of the smallest bandwidth hi approximately by factor p + 1 for 
getting table behavior of the method. 
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6.2 Uniform quadratic approximation of the local excess 

The previous subsection stated that the chance for any of the estimator 9 k (x) lying outside 
the neighborhood k (zo) is small, therefore in this subsection, we focus on the stochastic 
behavior of 9 k in <9fc(ro) . The proposed estimation procedure is likelihood-based: all 
quantities are defined in terms of the quasi log-likelihood functions L(W^ h \0) . Partic- 
ularly, the properties of the excess L(W^ k \ 9 k (x), 9*) d = L(W ( - k \6 k (x)) - L{W^ k \0*) 
play a very important role in the whole method. The famous Wilks result claims that the 
excess is asymptotically Xp+i • Unfortunately the local parametric approach for a narrow 
local neighborhood of the point x leads to a relatively small effective sample size N k , and 
the asymptotic results cannot be validated. The general parametric approach of Spokoiny 
(2011) though allows to operate with finite samples and it can be directly applied to a 
local parametric analysis. 
It holds 

n 

VL(WW,0*) = -Y,Pr(Yi - &70*)wl k) 
i=l 

n 

= J2{- T + JL ( Y i-^ e *<°)}^ W i k) - ( 2 °) 
i=l 

Further, for e = (5, g) and D\ from (16), define 

D\ k = D 2 k (l-S)-gVl 

Z e , k = D-IVL(WW,0*), (21) 

and similarly for e = f — e = (—6, — g) . The values S, g are assumed to be small enough 
to ensure that D 2 k is positive and the value 

a e ,k = f A max (ip + i - D e ^ k D^ 2 k D^ k ) (22) 

is small as well. Finally, define 

h e {W {k) ,9,9*) = {0-0*) T VL{W {k \0*) - ^\\D e , k (0 - 0*)\\ 2 

= zl k D e , k {e-e*)- l -\\D eM (e-o*)f 

and a similar definition for h^(W^ k \ 9, 9*) . 

Theorem 6.2. Under the conditions {EDq) , (ED{) , (Lq) , it holds for all k < K and 
all 9 £ <9 fc (r ) 

Le(wW,0, 9*) - e , fe < L(WW,0, 9*) < L e (W^,9, 9*) + e , k . (23) 
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Here Oe,fc ore the random error terms which fulfill with some Ci(A) and C-z(A) the 
following conditions: for any x > with Ci(A)x + C2(A) < y c 



^*(^ 1 C> e , fc >C 1 (A)x + C 2 (A)(p + l)) < C(A)e" x , 

^*k _1 <>e,fer<a(A), 

where y c zs a constant of order ■ 

The sandwiching result (23) for each k follows from Theorem 3.1 of Spokoiny (2011). 
It is only worth mentioning that the local sets &k(ro) are embedded: 6>i(ro) D ©2(^0) ^ 
. . . D 0s-(ro) , so it suffices to check the bound (23) on 6>i(ro) for each k < K . 

The majorization bound (23) yields that the maximum of the process IS 
also sandwiched between the maximum of h e (W^ k \ 6, 6*) and of h^W^, 6, 6*) up to a 
small random error term. Moreover, h e (W^ k \ 9, 6*) is quadratic in , and its maximum 
is given by a quadratic form ||£ ei fc|| 2 /2; similarly for h^W^ , 9, 0*) . The next result 
presents a probabilistic bound for such quadratic forms. 

Theorem 6.3. Assume A.l through A. 5 . There exist Ci(A) and C 2 (A) such that for 
each x with Ci(A)x + C2(A)(p + 1) < y c and k < K , it holds 

JV{||£ e , fe || 2 > Ci(A)x + C 2 (A)tp+l)} < 2e" x . 

Furthermore, for r > and k < K , it holds 

E\\^ k f r <C r {A). 

Consider the random set 

K 

k=i 

Due to the bound of Theorem 6.3, the choice Iq = Ci(A)(x+logK) + C , 2(A)(p+l) ensures 
that the probability of the set A2 is at least 1 — 2e _x . 

Below we restrict ourselves to the set A with A = A\ n A2 ■ By construction 

P{A) > 1 - 4e~ x 

and on this set 9\ 6 0fc(ro) and ||£ e fc|| < ro for all k < K . 

The results of Theorem 6.2 and 6.3 have a number of important corollaries; cf. Spokoiny 
(2011). 
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Corollary 1. It holds on A for every k < K 

\u e _, k \\ 2 ~ 0^ < L(w( k \e k (x),e*) < ^ 6)fc || 2 + 6lfc . (24) 

Corollary 2. It holds on A for every k < K 

\\D e , k {6 k (x) - 0*) - £ e J 2 < 40 e , fc + a e , fc ||£ e , fc || 2 , 

||£> e>fc (5*(s)-**)|| < 20^ 2 + (l + a^ 2 )||^ fe ||. (25) 

The result of Corollary 1 can be viewed as a non-asymptotic version of Wilks Theorem. 
It claims that the twice excess 2L(W^ k \ 6 k (x), 0*) can be approximated by the quadratic 
form ||£ e k \\ 2 . By definition (21), each vector £ e k is the normalized score 
This score is a weighted and centered sum of Bernoulli random variables H(Yj — < 0) 

with P e * (Yi - VjO* < 0) = r ; see (20). So, its distribution under Pq* only depends on 

(k) 

the design X±, . . . , X n and on the weights . This even applies to the joint distribution 
of all the £ e k for k = 1,...,K . This important pivotality property explains that the 
computed critical values i k are almost independent of the underlying distribution of the 
errors Yi — f{Xi). Further, by our identifiability condition (A3), D^ k is of the same 
order as the variance V k 2 = \ai{V L{W^ k \ 0*)} , so £ e k is nearly normal under usual 
assumptions, thus the twice excess is asymptotically Xp+i ■ 

One can summarize the obtained general results as follows. On the set A of dominating 
probability, each estimator 9 k (x) belongs to the local vicinity &h(^o) which yields the 
bounds (24), (25). Moreover, the random quantities (^ e ,k an d $, e>k obey the deviation 
and moment bounds of Theorem 6.2 and Theorem 6.3. 

6.2.1 Conditions from Spokoiny (2011) 

Here we list the conditions from Spokoiny (2011) which are assumed to be fulfilled for 
each local likelihood L{W^ k \6) , k < K . Some value ro is assumed to be fixed for all 
conditions. It separates the local zone of local quadratic approximation and the large 
deviation zone. The assumption are stated under the true data distribution P . However, 
we apply the assumptions only in the case of linear parametric structure with /(■) = f e * (•) . 
Define 

( k (0) d ^ L(W^,6)-EL(W^ k \e) 

n 

1=1 

Also denote V( k (6) = ^g((0) . The majorization bound (23) of Theorem 6.2 is stated in 
Spokoiny (2011) under the following conditions. 
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(EDo) There exists a positive symmetric matrix V k 2 , and constants g > and v o > 1 
such that Var{VCfc(^)} < V fc 2 and for all A with |A| < g, 

sup logflexpjA ^^fH < ul\ 2 /2- 

With this matrix , define the local set 

k (r o ) = {6: \\V k (0-0*)\\ < r }; 

(EDi) For each r < ro , there exists a constant g(r) < 1/2 such that it holds for all 
6 G fc (r o ) and |A| < g: 

sup log£exp<^ A S < i/ A /2; 

(£o) There are a positive matrix D k and for each r < ro and a constant 8(r) < 1/2, 
such that it holds for all 6 £ k ; 



-2EL(W( k \0,0*) 



\\D k (G - 



*MI2 



< *(r); 



(£Jr) For any r > ro , there exist a value g(r) > and a constant uq such that for all 
|A|<g(r), 



sup sup log.Eexp(A 7 .^ k \^ \ < ^o A V 2 ; 
GiRp+! eee fc (r) I 11*4711 J 



ee© fc (r) 

(£r) For each r > r and any with \\V k (0 - 0*)|| = r , 

—ELCW^^O, 6*) , . n 
> b(r) > 0. 



|| W - 

All these conditions are assumed to be fulfilled for each k < K . Conditions 
(EDo), (ED±) , (/Go) are local conditions which should be applied on the local set k (xo) , 
while (£r), (-Br) are global conditions which we apply on the complement of k (ro) . 
Also (EDo), (ED±), (Er) ar e smoothness or moment assumptions on the log likelihood 
process, and the conditions (£o)> (^ r ) ensure the identifiability properties. 

6.2.2 Proof of (Er) , (ED ) and {ED^ . 

Let us fix some k < K . Let N k be the local sample size for the weighting scheme ; 
see (18). Let also ro by fixed in a way that ro|^| < pN k for all i with > 0, that 
is, for all Xj with | Xi — x\ < h k . 
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First we check (Ex) . It holds by definition 

n 

va(0) = 2**W y < - ^ < °) - p ( Yi - ^ e < °)K W 
i=i 

n 

= 5> ei (*)«,<*> 

i=i 

with Ei{0) d = - VjO < 0) - P(y< - ipTfl < °) • Obviously - !P;T0 < °) is a 
Bernoulli random variable with the parameter qi{6) = f P(li — < 0) and 

logSexptfe^)} = log{l - + ft (0)e'} - 

The function = log(l — (/ + ge" 5 ) — g5 fulfills for any q < 1 

5 (0) = 0, «/(0) = 0, </'(<5) < g(l - g )e*. 

This implies 

]ogEexp{Sei(e)} < 9i (0){l - qi{0)}i%& /2, |<J| < gi 

for a constant fo > 1 depending on gi only. Therefore, it holds for any 7 € and 
p > with p\~f r &i\ < gi that, 

log Eexp{p7 T Va(0)} < log .Eexp ^^7^(0)^1 

n 

< ^log£;e X p{p7 T ^ l (6>) U ;f ) } 

i=l 
n 

< E^|7 T M fc) | 2 ?i(»){l - ?i(*)K/2 
i=l 

< vZp 2 \\V k (0) 7 f/2, 

where 

n 

i=i 

This yields (ED ) with V fe 2 d = V fc 2 (0*) and g = giiV fe 1/2 ; see (18). Furthermore, the linear 
PA / = fg* yields qi(0*) = r and hence 

V^e*) = AT{l-T)V 2 k 
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for 

i=i 

For any 6 G 0, it obviously holds V k 2 {6) < v\ < {4r(l - r)}~ V fc 2 , and thus (£r) is 
fulfilled with g 2 (r) = 4r(l - T)N k g\ . 

Next we check the local condition (ED\) . For r < ro and £ 0fc(r) , it holds 

n 

VC fe (0) - VC fe (0*) = - e,(^)}^ (fc) - 

i=l 

Similarly to the above, the identity E{si(6) — = qi(6) — qi(0*) implies 

logi<;exp[A7 T {VC(0)- VC(0*)}] 

< 2z. 2 A 2 ||F fc7 || 2 max| (/l (0) - qi (0*)\ H{w\ k) > 0) 

i<n 

< Lo(ry o X 2 \\V kl f/2 

with 

w(r) d = 4 max sup ( ft (0) - ft(0*)} } > 0). 

i< n ee© fc (r ) 

Further, for any 6 G <9 fc (r) , it holds |!pJ(0 - 9*)\wf ) < r/iV fe 

- %(0*)| !(«;<* } > 0) < C|<^(0 - !(«;{* } > 0) 

< CN k l/2 \\V k {0 - 6*)\\ < CN k 1/2 r, 

and (EDi) holds with g(r) = N~ 1/2 t . 
6.2.3 The (£ ) and (£r) conditions 

These identifiability conditions will be checked under the measure Pq* corresponding to 
the linear quantile function /(•) = fg*{-) . It holds 

VE *L(W (k \O) = -J2^{ T ~ P ( Yi ~ *7 e < °)} w i k) 
i=i 

and 

n 

-V^L^W 0) =Y,W7ti{*7{0 - e*))wf ) = D 2 k (0). 
1=1 
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Now the Taylor expansion of — E e *L(W^ k \ 9) at the extreme point 9 = 6* implies 

1 n 

E e *L(wV°\0) - E e *L(W( k \0*) = -^^Vj (0 - 0*)\ 2 li{VJ {e° - 9*))w ( - k) 

i=i 

= A {e -e*) T Dl(e°)(9-e*) 



for some 6° G [6, 6*} . Further, for any 6 G <9 fc (r ) , it holds by A.4 

l(w\ k) > 0) < 5, 



£i^J(0 -9*)) 



and (Lq) follows. The global identifiability condition (Lr) is fulfilled if r 2 > Ci(x+p+l) 
for some fixed constants C\ ; see Spokoiny (2011), Section 5.3, for more details. 

6.3 Theorem for critical values 

The theorem below assures an upper bound for the critical values % k constructed in 
Section 2.5. To avoid technical burdens, we restrict the analysis to the random set A and 
discard the large deviation probability part on its complement. The notation P'(B) for 
a set B means P(B n A) . 

Theorem 6.4. Suppose that r > 0, a > . There exist constants a^,a\ s.t. the propaga- 
tion condition is fulfilled with the choice of 

U = a + log(a _1 ) + ai r(K - k) + r log((p + 1)) (26) 

Proof. First we bound the quantity L(W^ h \ 9 k {x), 9g(x)) on the random set A = A\C\A2 ■ 
The majorization (23) and its corollary (24) yield on A with U£ k = f D^ k (9^{x) — 9*) 

L(W^,0 k (x),9 e (x)) = L(wW,d k (x),0*) -L(W (k \0 e (x),9*). 

< \U e ,kf -^e{W^ k \9 e (x),9*) +0 £ik 

= \Ue,kf ~ u Jk€e,k + \\\ u £k\\ 2 + 20 e ,fe 



|2 , ||„. it 2 



< ^(\\L,kW + \M\) +20e,k 



<Ue4 +\\vtk\r + 20e,k, (27) 

where we used the fact that ||£ e ,fc|| < ||£ e ,fcll • h is not difficult to see that 

\\ u ek\\ 2 = \\De,kDe} D e,e(9£ - 9*)\\ 2 < ||-De,fc-D~£-De,fc||oo || £>e,£ {9 1 ~ 9*) || 2 . 
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By construction D\ k <D\<D 2 ek and the definition (22) implies by a e>k < 1/2 

Dl k <(l-a e , k y l Dl k <2Dl k . 

Now it follows from condition A. 2 that 

2 2 , f 2 /uo~ £ , A;>^, 

ll^fc^i^fclloo < 2\\D k DfD k \\ 00 < { (28) 

[2u e ~ k , k<L 

Corollary 2 implies 

\\D £/ (6 e (x) - 9*)\\ < + (1 + a 1 JJ)U e:e \\ < 20'JJ + 2||^||. (29) 

We also use that E e *\\^ k \\ 2r < (p + l) r C r (A) for all k < K . Now it holds from (27), 
(28), and (29) for k > £ 

E' e «L r (w( k \e k (x)Mx)) < E' e , [||^, fc || 2 + 8u fc ^(^ 2 + ||^||) 2 + 20 e , fc ] r 

< C(A)(p + l)V' (/C ~' ) - (3°) 
Similarly one can show that for k < £ by u < 1 

E' ,L r (W^\e k (x)M^)) < K* [Ue,kf + 8 (OeJ + ll^ll) 2 + 2 O e ,fc] r 

< c(A)(p+iy. 

Also by Theorem 6.3 for x > 

P g .{L(wW,d k (x),d e (x)) >C\(p + l) + C 2 x} <2e-\ (31) 
These bounds can be used to check that the critical value i k which is selected in the form 

dcf ^ 

(26) to ensure the propagation condition in (11). Consider a random set H>i = {k{x) = £} , 
By definition of k , when ¥>£ happens, at least one of the estimator (x) must be not 
accepted, that is, 



[j {L(w^ m \e m (x),e e+1 (x)) > im y 



m=l 
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The bounds (30) and (31) yield by the Cauchy-Schwarz inequality 
E' e «U{W^,e k {x)Mx)) 

< £ [E' e ^ (W<*> , G k (x) , 9<{x))] 1/2 [P' e * (B,)] 1/2 
1=1 

<C(A)( P+ ir^u- 2 ^[P' e ^e)} 1/2 
l=\ 

k / i 

£=2 \m=l 

Fix Co > log(uQ 1 ) and consider i m = C\(p + 1) + C2X m with x m = 2cor(K — m) + 2x 
for some x. Then (31) implies 

E' * [L r (W^ k \e k (x),d k (x))} 

<c(A)( P+ irf2 u o 2r{K - e) (E 2e ~ Xm 

1=2 ^m=l 

if 

< C(A)(p+ l) 2r e- x ^exp[-2r(K - ^){c - log(l/u )}] 

e=2 

< C(A)(p+l) 2r e" x 

and the bound (11) follows with x = log(l/a) + r log(p + 1) + ao for a proper ao . □ 
6.4 Propagation Property and Stability 

The oracle result is a consequence of two properties of the procedure: propagation under 
homogeneity and stability. The first one means that the algorithm does not terminate 
for k < k* (no false alarm) with a high probability. The stability property ensures that 
the estimation quality will not essentially deteriorate in the steps "after propagation" for 
k > k* . 

By construction, the procedure described in Section 2 provides the prescribed perfor- 
mance if the true quantile function /(■) follows the parametric model: at any intermediate 
step k < K the non-adaptive estimator 9 k (x) and the adaptive estimator 9 k {x) coincide 
with high probability yielding that EQ*L r (w( k \6 k (x),0 k (x)^ is small. The next theorem 
claims a similar performance of the k step estimator k (x) under the true nonparametric 
model /(•) , however, the propagation property is only guaranteed for k < k* , that is, 
while the SMB assumption is fulfilled. 
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Theorem 6.5. Assume A k * (6) < A for some k* . Then for any k < k* 

Elog{l + L r (W (k \d k (x),d k (x))/%.} < A + a, (32) 

The bound (32) can be derived from the next general result; see Spokoiny (2009). 

Lemma 2. Let P, Pq , be two measures s.t. E\og(dP/dPo) < A < oo . For any 
random variable Z with EqZ < oo , it holds £71og(l + Z) < A + EqZ . 

The propagation result (32) explains well the behavior of the procedure for the first k* 
steps. In addition, we also need a stability property which makes sure that at the further 
steps of the algorithm for k > k* , the quality of the obtained adaptive estimator k (x) 
will not significantly deteriorate. The stability property can be stated as follows. 

Theorem 6.6. The adaptive estimator 0(x) fulfills 

L(W^\e k ,(x),d(x)) l{k(x) > k*} < 3fe *. (33) 

Due to (33), on the set {k{x) > k*} , the adaptive estimator 6{x) belongs to the 
confidence set £fc*(3fc*) of the oracle estimator 6 k *{x) . This assertion follows from the 
setup of our procedure because the estimate 6(x) = Q%^{x) is accepted. If k{x) > k* , it 
should be consistent with 6 k *(x) , and thus it belongs to the confidence set of 6 k *^(x) . 

6.5 Proof of the "oracle" property 

The propagation and stability results yield 

e io g { i + {w^ k '\e k , (x), e^x)) } 

= e [io g {i + Ji; 1 L r (w^\e k ,(x),e % (x))} i(k < F) 

+ E [log{l + R^U {W^\e k «{x),0 % {x))} l(k > k*) 

< A + Eg* [^^{W^A^x)^)) 

+ E\og [l + 5l~ l L r {W {k *\e k ,{x),e % {x)) K(k > F)} 

< Z\ + p + log(l + 3 fc ,/^r) 
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